Support hashing vectorizer inside a union #176

lopuhin · 2017-04-28T14:43:45Z

Also add an eli5.sklearn.invert_and_fit helper that inverts and fits a vectorizer (even if it's hiding in a union). The fix in 85c8a61 is not strictly related to this PR, I think it's an old issue revealed by new tests.

My primary motivation was deep-deep model and predictions explanation support.

Fixes #16

Refactor, start a test

kmike · 2017-04-28T18:31:19Z

docs/source/libraries/sklearn.rst

@@ -219,6 +219,17 @@ automatically; to handle HashingVectorizer_ and FeatureHasher_ for
    # and ``ivec`` can be used as a vectorizer for eli5.explain_weights:
    eli5.explain_weights(clf, vec=ivec)

+HashingVectorizer_ is also supported inside a FeatureUnion_:
+:func:`eli5.explain_prediction` handles this case automatically, and for
+:func:`eli5.explain_weights` you can use :func:`eli5.sklearn.invert_and_fit``


there is an extra ` at the end of the line

Right, thanks for spotting it, it's so tiny! Fixed in 94d81e9

Without this fix sklearn gives an error in sklearn/feature_extraction/hashing.py, line 142: TypeError: Expected bytes, got numpy.string_

codecov-io · 2017-04-28T21:02:09Z

Codecov Report

Merging #176 into master will increase coverage by 0.09%.
The diff coverage is 98.3%.

@@            Coverage Diff             @@
##           master     #176      +/-   ##
==========================================
+ Coverage   97.25%   97.34%   +0.09%     
==========================================
  Files          39       39              
  Lines        2405     2450      +45     
  Branches      452      464      +12     
==========================================
+ Hits         2339     2385      +46     
  Misses         34       34              
+ Partials       32       31       -1

Impacted Files	Coverage Δ
eli5/sklearn/__init__.py	`100% <100%> (ø)`	⬆️
eli5/sklearn/utils.py	`87.38% <100%> (-0.34%)`	⬇️
eli5/sklearn/unhashing.py	`96.95% <98.14%> (+2.12%)`	⬆️

lopuhin · 2017-04-30T07:41:31Z

@kmike this is ready for review, could you please check it again? I updated the PR description and added a py2 fix.

This feature could be less complicated if InvertableHashingVectorizer would return feature names as strings, not as lists of {'name': , 'sign': } dicts - in this case we would not need code that is currently in _invhashing_union_feature_names_scale. But in this case we would loose the ability to format inverted features nicely, and maybe also work with their signs, etc.

lopuhin · 2017-04-30T07:44:52Z

The main concert I have here is the name of the invert_and_fit function - it makes sense as a function in eli5.sklearn.unhashing, but maybe for eli5.sklearn it needs a more descriptive name? We'll be making it public, so it would be harder to change the name.

kmike · 2017-05-02T10:17:29Z

eli5/sklearn/unhashing.py

+    """ Create an InvertableHashingVectorizer from hashing vectorizer vec
+    and fit it on docs. If vec is a FeatureUnion, do it for all
+    hashing vectorizers in the union.
+    Returns an InvertableHashingVectorizer, or a Union, or an unchanged vectorizer.


Union -> FeatureUnion?

Right, fixed in b89d0f3. Thanks!

kmike · 2017-05-02T10:46:51Z

Yeah, I think it should be named either invert_hashing_and_fit, or imported & documented consistently as sklearn.hashing.invert_and_fit. The former is probably easier.

@kmike

Thanks @kmike for suggestion! Also fix docstring: Union -> FeatureUnion

lopuhin · 2017-05-02T12:49:06Z

Yes, invert_hashing_and_fit sounds much better than invert_and_fit, thanks!
Changed the name in b89d0f3

kmike · 2017-05-02T13:04:40Z

Looks good, thanks @lopuhin!

lopuhin · 2017-05-02T13:07:39Z

Thanks for review @kmike !

lopuhin added 10 commits January 23, 2017 19:54

Support hashing vec in a feature union

f6c31ad

Merge branch 'master' into union-hashing-vec-3

3be0a2f

Pass one more test with union + hashing vec

6b90ab3

Comment on hashing vec, union and feature names

bd23f71

hashing vectorizer in Union for explain_weights

239b414

Refactor, start a test

Add tests for weights and union with non-text features

2a73dab

Add mypy annotations

eab82d7

A helper for inverting and fitting a hashing vec

c2c08b2

Rm unused imports

071d7a3

Simplify code, handle recursive case

907818a

kmike reviewed Apr 28, 2017

View reviewed changes

lopuhin added 2 commits April 28, 2017 23:51

DOC remove extra quote

94d81e9

py2 fix type error with bytes vs. np.string

85c8a61

Without this fix sklearn gives an error in sklearn/feature_extraction/hashing.py, line 142: TypeError: Expected bytes, got numpy.string_

lopuhin changed the title ~~[WIP] Support hashing vectorizer inside a union~~ Support hashing vectorizer inside a union Apr 30, 2017

kmike reviewed May 2, 2017

View reviewed changes

A more descriptive name: invert_hashing_and_fit

b89d0f3

Thanks @kmike for suggestion! Also fix docstring: Union -> FeatureUnion

kmike merged commit 73f0ac2 into master May 2, 2017

lopuhin deleted the union-hashing-vec-3 branch May 2, 2017 13:07

kmike added this to the 0.6 milestone May 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hashing vectorizer inside a union #176

Support hashing vectorizer inside a union #176

lopuhin commented Apr 28, 2017 •

edited

Loading

kmike Apr 28, 2017

lopuhin Apr 28, 2017

codecov-io commented Apr 28, 2017 •

edited

Loading

lopuhin commented Apr 30, 2017

lopuhin commented Apr 30, 2017

kmike May 2, 2017

lopuhin May 2, 2017

kmike commented May 2, 2017

lopuhin commented May 2, 2017

kmike commented May 2, 2017

lopuhin commented May 2, 2017

Support hashing vectorizer inside a union #176

Support hashing vectorizer inside a union #176

Conversation

lopuhin commented Apr 28, 2017 • edited Loading

kmike Apr 28, 2017

Choose a reason for hiding this comment

lopuhin Apr 28, 2017

Choose a reason for hiding this comment

codecov-io commented Apr 28, 2017 • edited Loading

Codecov Report

lopuhin commented Apr 30, 2017

lopuhin commented Apr 30, 2017

kmike May 2, 2017

Choose a reason for hiding this comment

lopuhin May 2, 2017

Choose a reason for hiding this comment

kmike commented May 2, 2017

lopuhin commented May 2, 2017

kmike commented May 2, 2017

lopuhin commented May 2, 2017

lopuhin commented Apr 28, 2017 •

edited

Loading

codecov-io commented Apr 28, 2017 •

edited

Loading